Back in 2014, fivethiryeight.com published an article on alchohol consumption in different countries. The data drinks is available as part of the fivethirtyeight package. Make sure you have installed the fivethirtyeight package before proceeding.
library(fivethirtyeight)
data(drinks)
# or download directly
# alcohol_direct <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv")What are the variable types? Any missing values we should worry about?
No missing values.
| Column name | country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol |
|---|---|---|---|---|---|
| Variable type | character | integer | integer | integer | double |
glimpse(drinks)## Rows: 193
## Columns: 5
## $ country <chr> "Afghanistan", "Albania", "Algeria", "And…
## $ beer_servings <int> 0, 89, 25, 245, 217, 102, 193, 21, 261, 2…
## $ spirit_servings <int> 0, 132, 0, 138, 57, 128, 25, 179, 72, 75,…
## $ wine_servings <int> 0, 54, 14, 312, 45, 45, 221, 11, 212, 191…
## $ total_litres_of_pure_alcohol <dbl> 0.0, 4.9, 0.7, 12.4, 5.9, 4.9, 8.3, 3.8, …
skim(drinks)| Name | drinks |
| Number of rows | 193 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| country | 0 | 1 | 3 | 28 | 0 | 193 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| beer_servings | 0 | 1 | 106.16 | 101.14 | 0 | 20.0 | 76.0 | 188.0 | 376.0 | ▇▃▂▂▁ |
| spirit_servings | 0 | 1 | 80.99 | 88.28 | 0 | 4.0 | 56.0 | 128.0 | 438.0 | ▇▃▂▁▁ |
| wine_servings | 0 | 1 | 49.45 | 79.70 | 0 | 1.0 | 8.0 | 59.0 | 370.0 | ▇▁▁▁▁ |
| total_litres_of_pure_alcohol | 0 | 1 | 4.72 | 3.77 | 0 | 1.3 | 4.2 | 7.2 | 14.4 | ▇▃▅▃▁ |
Make a plot that shows the top 25 beer consuming countries
drinks %>%
slice_max ( order_by = beer_servings, n=25 ) %>% # taking top 25 countries by servings
ggplot(aes(x = beer_servings, y = fct_reorder(country, beer_servings))) +
geom_col(fill="orange") +
labs(
title = "Top 25 Beer Serving Countries in 2010",
subtitle = "Standard Servings Per Person",
x = "Beer Servings (in cans)",
y = "Country"
)Make a plot that shows the top 25 wine consuming countries
drinks %>%
slice_max ( order_by = wine_servings, n=25 ) %>% # taking top 25 countries by servings
ggplot(aes(x = wine_servings, y = fct_reorder(country, wine_servings))) +
geom_col(fill="dark red") +
labs(
title = "Top 25 Wine Serving Countries in 2010",
subtitle = "Standard Servings Per Person",
x = "Wine Servings (in glasses)",
y = "Country"
)Finally, make a plot that shows the top 25 spirit consuming countries
drinks %>%
slice_max ( order_by = spirit_servings, n=25 ) %>% # taking top 25 countries by servings
ggplot(aes(x = spirit_servings, y = fct_reorder(country, spirit_servings))) +
geom_col(fill="grey") +
labs(
title = "Top 25 Spirit Serving Countries in 2010",
subtitle = "Servings (in shots) Per Person",
x = "Spirit Servings",
y = "Country"
)What can you infer from these plots? Don’t just explain what’s in the graph, but speculate or tell a short story (1-2 paragraphs max).
TYPE YOUR ANSWER AFTER (AND OUTSIDE!) THIS BLOCKQUOTE.
European countries are high consumers of wine.
Beer is more evenly distributed around the world in the top 25 countries, as compared to wine and spirit.
European countries are higher ranked for overall consumption of beer, wine and spirit.
We will look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset
Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:
gross : The gross earnings in the US box office, not adjusted for inflationbudget: The movie’s budgetcast_facebook_likes: the number of facebook likes cast members receivedvotes: the number of people who voted for (or rated) the movie in IMDBreviews: the number of reviews for that movierating: IMDB average ratingmovies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)## Rows: 2,961
## Columns: 11
## $ title <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …
# no missing values. There are Duplicates (2907 distinct titles in 2961 rows).
skim(movies)| Name | movies |
| Number of rows | 2961 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| title | 0 | 1 | 1 | 83 | 0 | 2907 | 0 |
| genre | 0 | 1 | 5 | 11 | 0 | 17 | 0 |
| director | 0 | 1 | 3 | 32 | 0 | 1366 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2.00e+03 | 9.95e+00 | 1920.0 | 2.00e+03 | 2.00e+03 | 2.01e+03 | 2.02e+03 | ▁▁▁▂▇ |
| duration | 0 | 1 | 1.10e+02 | 2.22e+01 | 37.0 | 9.50e+01 | 1.06e+02 | 1.19e+02 | 3.30e+02 | ▃▇▁▁▁ |
| gross | 0 | 1 | 5.81e+07 | 7.25e+07 | 703.0 | 1.23e+07 | 3.47e+07 | 7.56e+07 | 7.61e+08 | ▇▁▁▁▁ |
| budget | 0 | 1 | 4.06e+07 | 4.37e+07 | 218.0 | 1.10e+07 | 2.60e+07 | 5.50e+07 | 3.00e+08 | ▇▂▁▁▁ |
| cast_facebook_likes | 0 | 1 | 1.24e+04 | 2.05e+04 | 0.0 | 2.24e+03 | 4.60e+03 | 1.69e+04 | 6.57e+05 | ▇▁▁▁▁ |
| votes | 0 | 1 | 1.09e+05 | 1.58e+05 | 5.0 | 1.99e+04 | 5.57e+04 | 1.33e+05 | 1.69e+06 | ▇▁▁▁▁ |
| reviews | 0 | 1 | 5.03e+02 | 4.94e+02 | 2.0 | 1.99e+02 | 3.64e+02 | 6.31e+02 | 5.31e+03 | ▇▁▁▁▁ |
| rating | 0 | 1 | 6.39e+00 | 1.05e+00 | 1.6 | 5.80e+00 | 6.50e+00 | 7.10e+00 | 9.30e+00 | ▁▁▆▇▁ |
# show the duplicate movies
movies %>% count(title, sort=T)## # A tibble: 2,907 × 2
## title n
## <chr> <int>
## 1 Home 3
## 2 A Nightmare on Elm Street 2
## 3 Across the Universe 2
## 4 Alice in Wonderland 2
## 5 Aloha 2
## 6 Around the World in 80 Days 2
## 7 Brothers 2
## 8 Carrie 2
## 9 Chasing Liberty 2
## 10 Cinderella 2
## # … with 2,897 more rows
# to see what happens with the duplicates
movies %>% filter(title=="Homes")## # A tibble: 0 × 11
## # … with 11 variables: title <chr>, genre <chr>, director <chr>, year <dbl>,
## # duration <dbl>, gross <dbl>, budget <dbl>, cast_facebook_likes <dbl>,
## # votes <dbl>, reviews <dbl>, rating <dbl>
# `distinct` function can only keep the first entry but not latest
# movies <- distinct(movies, title, .keep_all=T)
length(unique(movies$title))## [1] 2907
movies <- movies %>%
group_by(title) %>%
filter(votes == max(votes)) %>%
ungroup()
# there are still duplicates
movies %>% count(title, sort=T)## # A tibble: 2,907 × 2
## title n
## <chr> <int>
## 1 Chasing Liberty 2
## 2 10 Cloverfield Lane 1
## 3 10 Days in a Madhouse 1
## 4 10 Things I Hate About You 1
## 5 102 Dalmatians 1
## 6 10th & Wolf 1
## 7 12 Rounds 1
## 8 12 Years a Slave 1
## 9 127 Hours 1
## 10 13 Going on 30 1
## # … with 2,897 more rows
# to see what happens with the duplicates
movies %>% filter(title=="Chasing Liberty")## # A tibble: 2 × 11
## title genre director year duration gross budget cast_facebook_l… votes
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Chasing… Comedy Andy Cad… 2004 101 1.22e7 2.3e7 842 30092
## 2 Chasing… Comedy Andy Cad… 2004 101 1.22e7 2.3e7 829 30092
## # … with 2 more variables: reviews <dbl>, rating <dbl>
# do the filter only for the entries of Chasing Liberty
movies <- movies %>%
group_by(title) %>%
filter(cast_facebook_likes==max(cast_facebook_likes)) %>%
ungroup()
skim(movies)| Name | movies |
| Number of rows | 2907 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| title | 0 | 1 | 1 | 83 | 0 | 2907 | 0 |
| genre | 0 | 1 | 5 | 11 | 0 | 17 | 0 |
| director | 0 | 1 | 3 | 32 | 0 | 1366 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2.00e+03 | 9.92e+00 | 1920.0 | 2.00e+03 | 2.00e+03 | 2.01e+03 | 2.02e+03 | ▁▁▁▂▇ |
| duration | 0 | 1 | 1.10e+02 | 2.23e+01 | 37.0 | 9.50e+01 | 1.05e+02 | 1.19e+02 | 3.30e+02 | ▃▇▁▁▁ |
| gross | 0 | 1 | 5.76e+07 | 7.23e+07 | 703.0 | 1.20e+07 | 3.45e+07 | 7.51e+07 | 7.61e+08 | ▇▁▁▁▁ |
| budget | 0 | 1 | 4.02e+07 | 4.32e+07 | 218.0 | 1.10e+07 | 2.50e+07 | 5.50e+07 | 3.00e+08 | ▇▂▁▁▁ |
| cast_facebook_likes | 0 | 1 | 1.23e+04 | 2.05e+04 | 0.0 | 2.22e+03 | 4.54e+03 | 1.68e+04 | 6.57e+05 | ▇▁▁▁▁ |
| votes | 0 | 1 | 1.09e+05 | 1.59e+05 | 5.0 | 1.95e+04 | 5.47e+04 | 1.32e+05 | 1.69e+06 | ▇▁▁▁▁ |
| reviews | 0 | 1 | 4.98e+02 | 4.93e+02 | 2.0 | 1.97e+02 | 3.58e+02 | 6.24e+02 | 5.31e+03 | ▇▁▁▁▁ |
| rating | 0 | 1 | 6.39e+00 | 1.06e+00 | 1.6 | 5.80e+00 | 6.50e+00 | 7.10e+00 | 9.30e+00 | ▁▁▆▇▁ |
movies %>% count(genre, sort = TRUE)## # A tibble: 17 × 2
## genre n
## <chr> <int>
## 1 Comedy 844
## 2 Action 719
## 3 Drama 484
## 4 Adventure 281
## 5 Crime 198
## 6 Biography 135
## 7 Horror 128
## 8 Animation 35
## 9 Fantasy 26
## 10 Documentary 25
## 11 Mystery 15
## 12 Sci-Fi 7
## 13 Family 3
## 14 Musical 2
## 15 Romance 2
## 16 Western 2
## 17 Thriller 1
gross and budget) by genre. Calculate a variable return_on_budget which shows how many $ did a movie make at the box office for each $ of its budget. Ranked genres by this return_on_budget in descending ordermovies %>%
mutate(movies_return = gross/budget ) %>%
group_by(genre) %>%
summarise(avg_gross = mean(gross),
avg_budget = mean(budget),
genre_return_on_budget = sum(gross)/sum(budget),
movie_mean_return_on_budget = mean(movies_return)) %>%
arrange(-movie_mean_return_on_budget)## # A tibble: 17 × 5
## genre avg_gross avg_budget genre_return_on_budget movie_mean_return_o…
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Horror 37782310. 13804379. 2.74 86.1
## 2 Biography 45201805. 28543696. 1.58 22.3
## 3 Musical 92084000 3189500 28.9 18.8
## 4 Family 149160478. 14833333. 10.1 14.1
## 5 Documentary 17353973. 5887852. 2.95 8.70
## 6 Western 20821884 3465000 6.01 7.06
## 7 Fantasy 41902674. 18484615. 2.27 6.10
## 8 Animation 98433792. 61701429. 1.60 5.01
## 9 Comedy 42487808. 24458506. 1.74 3.70
## 10 Romance 31264848. 25107500 1.25 3.17
## 11 Drama 36754959. 25832605. 1.42 2.98
## 12 Mystery 69117136. 41500000 1.67 2.90
## 13 Adventure 94350236. 64692313. 1.46 2.44
## 14 Crime 37601525. 26527405. 1.42 2.19
## 15 Action 86270343. 70774558. 1.22 1.93
## 16 Sci-Fi 29788371. 27607143. 1.08 1.58
## 17 Thriller 2468 300000 0.00823 0.00823
movies %>%
group_by(director) %>%
summarise(total_gross = sum(gross),
mean_gross = mean(gross),
median_gross = median(gross),
standard_dev_gross = sd(gross)) %>%
slice_max ( order_by = total_gross, n = 15)## # A tibble: 15 × 5
## director total_gross mean_gross median_gross standard_dev_gross
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Steven Spielberg 4014061704 174524422. 164435221 101421051.
## 2 Michael Bay 2195443511 182953626. 168468240. 125789167.
## 3 James Cameron 1909725910 318287652. 175562880. 309171337.
## 4 Christopher Nolan 1813227576 226653447 196667606. 187224133.
## 5 George Lucas 1741418480 348283696 380262555 146193880.
## 6 Robert Zemeckis 1619309108 124562239. 100853835 91300279.
## 7 Tim Burton 1557078534 111219895. 69791834 99304293.
## 8 Sam Raimi 1443167519 180395940. 138480208 174705230.
## 9 Clint Eastwood 1378321100 72543216. 46700000 75487408.
## 10 Francis Lawrence 1358501971 271700394. 281666058 135437020.
## 11 Ron Howard 1335988092 111332341 101587923 81933761.
## 12 Gore Verbinski 1329600995 189942999. 123207194 154473822.
## 13 Andrew Adamson 1137446920 284361730 279680930. 120895765.
## 14 Shawn Levy 1129750988 102704635. 85463309 65484773.
## 15 Ridley Scott 1128857598 80632686. 47775715 68812285.
movies_rating <- movies %>%
group_by(genre) %>%
summarise(mean_rating = mean(rating),
min_rating = min(rating),
max_rating = max(rating),
sd_rating = sd(rating))
movies_rating## # A tibble: 17 × 5
## genre mean_rating min_rating max_rating sd_rating
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Action 6.23 2.1 9 1.04
## 2 Adventure 6.51 2.3 8.6 1.11
## 3 Animation 6.65 4.5 8 0.968
## 4 Biography 7.11 4.5 8.9 0.760
## 5 Comedy 6.11 1.9 8.8 1.02
## 6 Crime 6.92 4.8 9.3 0.853
## 7 Documentary 6.66 1.6 8.5 1.77
## 8 Drama 6.74 2.1 8.8 0.915
## 9 Family 6.5 5.7 7.9 1.22
## 10 Fantasy 6.08 4.3 7.9 0.953
## 11 Horror 5.79 3.6 8.5 0.987
## 12 Musical 6.75 6.3 7.2 0.636
## 13 Mystery 6.84 4.6 8.5 0.910
## 14 Romance 6.65 6.2 7.1 0.636
## 15 Sci-Fi 6.66 5 8.2 1.09
## 16 Thriller 4.8 4.8 4.8 NA
## 17 Western 5.7 4.1 7.3 2.26
movies %>%
ggplot(mapping = aes(x = rating)) +
geom_histogram(bins=30) +
facet_wrap(~genre)+
labs(title = "Distribution of ratings in each genre",
x = "Rating (1-10)",
y = "Num of movies") NULL## NULL
ggplot to answer the followinggross and cast_facebook_likes. Produce a scatterplot and write one sentence discussing whether the number of facebook likes that the cast has received is likely to be a good predictor of how much money a movie will make at the box office. What variable are you going to map to the Y- and X- axes?ggplot(movies, aes(x = cast_facebook_likes, y = gross)) +
geom_point() +
geom_smooth(method = "lm")+
NULLgross and budget. Produce a scatterplot and write one sentence discussing whether budget is likely to be a good predictor of how much money a movie will make at the box office.ggplot(movies, aes(x = budget , y = gross)) +
geom_point() +
geom_smooth(method = "lm")The budget and gross do seem correlated. The higher the budget, it is more likely that the gross may be higher.
Examine the relationship between gross and rating. Produce a scatterplot, faceted by genre and discuss whether IMDB ratings are likely to be a good predictor of how much money a movie will make at the box office. Is there anything strange in this dataset?
ggplot(movies, aes(x = rating , y = gross)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~genre) +
labs(title = "Gross vs Rating of Movies For Each Genre ",
x = "Rating",
y = "Gross") +
NULLWe can see that:
The higher the rating the more will be the gross for the most genres of movies.
For movies of some genres like ‘Documentary’, ‘Mystery’, ‘Horror’ and ‘Sci-Fi’, the gross has a very less change with respect to rating. Documentaries certainly have a different business model.
Negative correlation even appears.
You may find useful the material on finance data sources.
We will use the tidyquant package to download historical data of stock prices, calculate returns, and examine the distribution of returns.
We must first identify which stocks we want to download data for, and for this we must know their ticker symbol; Apple is known as AAPL, Microsoft as MSFT, McDonald’s as MCD, etc. The file nyse.csv contains 508 stocks listed on the NYSE, their ticker symbol, name, the IPO (Initial Public Offering) year, and the sector and industry the company is in.
nyse <- read_csv(here::here("data","nyse.csv"))
glimpse(nyse)## Rows: 508
## Columns: 6
## $ symbol <chr> "MMM", "ABB", "ABT", "ABBV", "ACN", "AAP", "AFL", "A", "…
## $ name <chr> "3M Company", "ABB Ltd", "Abbott Laboratories", "AbbVie …
## $ ipo_year <chr> "n/a", "n/a", "n/a", "2012", "2001", "n/a", "n/a", "1999…
## $ sector <chr> "Health Care", "Consumer Durables", "Health Care", "Heal…
## $ industry <chr> "Medical/Dental Instruments", "Electrical Products", "Ma…
## $ summary_quote <chr> "https://www.nasdaq.com/symbol/mmm", "https://www.nasdaq…
nyse$sector## [1] "Health Care" "Consumer Durables" "Health Care"
## [4] "Health Care" "Miscellaneous" "Consumer Services"
## [7] "Finance" "Capital Goods" "Basic Industries"
## [10] "Basic Industries" "Health Care" "Consumer Services"
## [13] "Miscellaneous" "Finance" "Health Care"
## [16] "Finance" "Finance" "Consumer Services"
## [19] "Consumer Non-Durables" "Consumer Non-Durables" "Consumer Durables"
## [22] "Public Utilities" "Public Utilities" "Public Utilities"
## [25] "Public Utilities" "Finance" "Finance"
## [28] "Consumer Services" "Public Utilities" "Finance"
## [31] "Health Care" "Capital Goods" "Consumer Durables"
## [34] "Consumer Non-Durables" "Consumer Services" "Health Care"
## [37] "Finance" "Finance" "Capital Goods"
## [40] "Consumer Services" "Basic Industries" "Consumer Non-Durables"
## [43] "Capital Goods" "Technology" "Finance"
## [46] "Health Care" "Public Utilities" "Public Utilities"
## [49] "Technology" "Consumer Services" "Consumer Services"
## [52] "Public Utilities" "Finance" "Energy"
## [55] "Consumer Durables" "Finance" "Finance"
## [58] "Finance" "Finance" "Finance"
## [61] "Finance" "Finance" "Finance"
## [64] "Finance" "Finance" "Finance"
## [67] "Finance" "Finance" "Basic Industries"
## [70] "Health Care" "Finance" "Public Utilities"
## [73] "Health Care" "Consumer Services" "Basic Industries"
## [76] "Energy" "Capital Goods" "Finance"
## [79] "Capital Goods" "Consumer Services" "Consumer Services"
## [82] "Health Care" "Energy" "Health Care"
## [85] "Consumer Non-Durables" "Miscellaneous" "Consumer Services"
## [88] "Consumer Services" "Finance" "Public Utilities"
## [91] "Consumer Services" "Energy" "Consumer Services"
## [94] "Consumer Non-Durables" "Finance" "Transportation"
## [97] "Energy" "Transportation" "Miscellaneous"
## [100] "Finance" "Health Care" "Consumer Durables"
## [103] "Consumer Services" "Consumer Services" "Consumer Durables"
## [106] "Capital Goods" "Finance" "Consumer Services"
## [109] "Basic Industries" "Energy" "Health Care"
## [112] "Public Utilities" "Public Utilities" "Public Utilities"
## [115] "Basic Industries" "Consumer Services" "Energy"
## [118] "Consumer Services" "Finance" "Public Utilities"
## [121] "Energy" "Public Utilities" "Public Utilities"
## [124] "Consumer Services" "Finance" "Public Utilities"
## [127] "Basic Industries" "Health Care" "Finance"
## [130] "Finance" "Consumer Durables" "Public Utilities"
## [133] "Finance" "Capital Goods" "Consumer Non-Durables"
## [136] "Consumer Non-Durables" "Consumer Non-Durables" "Consumer Non-Durables"
## [139] "Consumer Non-Durables" "Energy" "Energy"
## [142] "Public Utilities" "Consumer Non-Durables" "Energy"
## [145] "Basic Industries" "Consumer Non-Durables" "Finance"
## [148] "Finance" "Capital Goods" "Consumer Services"
## [151] "Energy" "Health Care" "Capital Goods"
## [154] "Capital Goods" "Consumer Services" "Capital Goods"
## [157] "Technology" "Transportation" "Finance"
## [160] "Consumer Non-Durables" "Consumer Services" "Finance"
## [163] "Consumer Services" "Public Utilities" "Technology"
## [166] "Basic Industries" "Public Utilities" "Public Utilities"
## [169] "Consumer Services" "Basic Industries" "Technology"
## [172] "Basic Industries" "Energy" "Public Utilities"
## [175] "Health Care" "Health Care" "Health Care"
## [178] "Energy" "Energy" "Public Utilities"
## [181] "Public Utilities" "Energy" "Public Utilities"
## [184] "Public Utilities" "Energy" "Technology"
## [187] "Finance" "Energy" "Consumer Services"
## [190] "Consumer Services" "Consumer Services" "Consumer Non-Durables"
## [193] "Finance" "Public Utilities" "Public Utilities"
## [196] "Public Utilities" "Consumer Services" "Energy"
## [199] "Technology" "Miscellaneous" "Consumer Services"
## [202] "Transportation" "Capital Goods" "Capital Goods"
## [205] "Finance" "Miscellaneous" "Public Utilities"
## [208] "Miscellaneous" "Basic Industries" "Consumer Non-Durables"
## [211] "Capital Goods" "Public Utilities" "Capital Goods"
## [214] "Basic Industries" "Finance" "Basic Industries"
## [217] "Health Care" "Consumer Services" "Capital Goods"
## [220] "Energy" "Consumer Non-Durables" "Capital Goods"
## [223] "Capital Goods" "Health Care" "Miscellaneous"
## [226] "Finance" "Technology" "Finance"
## [229] "Energy" "Finance" "Health Care"
## [232] "Consumer Services" "Finance" "Capital Goods"
## [235] "Consumer Non-Durables" "Energy" "Technology"
## [238] "Consumer Services" "Consumer Services" "Capital Goods"
## [241] "Capital Goods" "Consumer Non-Durables" "Consumer Services"
## [244] "Technology" "Finance" "Health Care"
## [247] "Finance" "Capital Goods" "Technology"
## [250] "Technology" "Technology" "Finance"
## [253] "Capital Goods" "Finance" "Consumer Services"
## [256] "Technology" "Basic Industries" "Basic Industries"
## [259] "Finance" "Health Care" "Finance"
## [262] "Finance" "Consumer Non-Durables" "Basic Industries"
## [265] "Health Care" "Consumer Services" "Transportation"
## [268] "Finance" "Consumer Non-Durables" "Consumer Non-Durables"
## [271] "Finance" "Capital Goods" "Consumer Durables"
## [274] "Public Utilities" "Finance" "Energy"
## [277] "Public Utilities" "Consumer Services" "Capital Goods"
## [280] "Health Care" "Consumer Non-Durables" "Consumer Services"
## [283] "Technology" "Basic Industries" "Finance"
## [286] "Basic Industries" "Consumer Services" "Finance"
## [289] "Capital Goods" "Finance" "Consumer Services"
## [292] "Basic Industries" "Finance" "Energy"
## [295] "Capital Goods" "Finance" "Energy"
## [298] "Energy" "Finance" "Finance"
## [301] "Basic Industries" "Basic Industries" "Miscellaneous"
## [304] "Consumer Non-Durables" "Consumer Services" "Health Care"
## [307] "Health Care" "Health Care" "Finance"
## [310] "Capital Goods" "Consumer Services" "Consumer Services"
## [313] "Finance" "Finance" "Consumer Non-Durables"
## [316] "Finance" "Finance" "Technology"
## [319] "Energy" "Miscellaneous" "Public Utilities"
## [322] "Consumer Services" "Basic Industries" "Public Utilities"
## [325] "Consumer Non-Durables" "Public Utilities" "Energy"
## [328] "Technology" "Finance" "Transportation"
## [331] "Capital Goods" "Consumer Services" "Health Care"
## [334] "Health Care" "Basic Industries" "Basic Industries"
## [337] "Capital Goods" "Energy" "Technology"
## [340] "Public Utilities" "Technology" "Public Utilities"
## [343] "Finance" "Technology" "Technology"
## [346] "Capital Goods" "Technology" "Energy"
## [349] "Energy" "Energy" "Health Care"
## [352] "Consumer Non-Durables" "Energy" "Energy"
## [355] "Public Utilities" "Technology" "Energy"
## [358] "Energy" "Finance" "Basic Industries"
## [361] "Basic Industries" "Public Utilities" "Basic Industries"
## [364] "Finance" "Consumer Services" "Finance"
## [367] "Finance" "Public Utilities" "Public Utilities"
## [370] "Consumer Services" "Health Care" "Finance"
## [373] "Capital Goods" "Consumer Services" "Finance"
## [376] "Consumer Services" "Public Utilities" "Health Care"
## [379] "Consumer Services" "Technology" "Basic Industries"
## [382] "Capital Goods" "Consumer Services" "Finance"
## [385] "Capital Goods" "Finance" "Finance"
## [388] "Consumer Services" "Finance" "Technology"
## [391] "Technology" "Energy" "Energy"
## [394] "Consumer Services" "Public Utilities" "Technology"
## [397] "Consumer Services" "Finance" "Technology"
## [400] "Consumer Services" "Public Utilities" "Consumer Services"
## [403] "Health Care" "Technology" "Consumer Non-Durables"
## [406] "Public Utilities" "Basic Industries" "Transportation"
## [409] "Consumer Services" "Public Utilities" "Technology"
## [412] "Capital Goods" "Finance" "Health Care"
## [415] "Technology" "Health Care" "Finance"
## [418] "Consumer Services" "Finance" "Energy"
## [421] "Finance" "Finance" "Consumer Non-Durables"
## [424] "Technology" "Health Care" "Consumer Services"
## [427] "Consumer Services" "Public Utilities" "Technology"
## [430] "Energy" "Capital Goods" "Health Care"
## [433] "Public Utilities" "Public Utilities" "Public Utilities"
## [436] "Basic Industries" "Consumer Services" "Capital Goods"
## [439] "Basic Industries" "Finance" "Finance"
## [442] "Health Care" "Finance" "Capital Goods"
## [445] "Consumer Services" "Consumer Services" "Consumer Services"
## [448] "Finance" "Energy" "Miscellaneous"
## [451] "Capital Goods" "Capital Goods" "Finance"
## [454] "Technology" "Technology" "Consumer Non-Durables"
## [457] "Finance" "Technology" "Finance"
## [460] "Consumer Services" "Public Utilities" "Basic Industries"
## [463] "Basic Industries" "Transportation" "Transportation"
## [466] "Capital Goods" "Health Care" "Health Care"
## [469] "Consumer Non-Durables" "Basic Industries" "Energy"
## [472] "Technology" "Consumer Services" "Public Utilities"
## [475] "Consumer Services" "Miscellaneous" "Public Utilities"
## [478] "Technology" "Consumer Services" "Basic Industries"
## [481] "Consumer Services" "Finance" "Consumer Services"
## [484] "Consumer Services" "Consumer Services" "Public Utilities"
## [487] "Public Utilities" "Capital Goods" "Consumer Services"
## [490] "Public Utilities" "Health Care" "Finance"
## [493] "Consumer Services" "Basic Industries" "Public Utilities"
## [496] "Capital Goods" "Finance" "Consumer Services"
## [499] "Basic Industries" "Public Utilities" "Technology"
## [502] "Technology" "Capital Goods" "Consumer Services"
## [505] "Consumer Services" "Health Care" "Health Care"
## [508] "Transportation"
Based on this dataset, create a table and a bar plot that shows the number of companies per sector, in descending order
# a easier way
# nyse %>%
# select(sector) %>%
# table() %>%
# sort(decreasing = T) %>%
# barplot()
nyse %>%
group_by(sector) %>%
mutate(company_num = count(sector)) %>%
ggplot(aes(x=company_num, y=fct_reorder(sector, company_num))) +
geom_bar(stat="identity") +
labs(title = "Number of Companies in each Sector",
x = "",
y = "Sector")Next, let’s choose some stocks and their ticker symbols and download some data. You MUST choose 6 different stocks from the ones listed below; You should, however, add SPY which is the SP500 ETF (Exchange Traded Fund).
# Notice the cache=TRUE argument inthe chunk options. Because getting data is time consuming,
# cache=TRUE means that once it downloads data, the chunk will not run again next time you knit your Rmd
myStocks <- c("EBR","BBL","AEE","BCE","BRO","CAT","BUD","SPY" ) %>%
tq_get(get = "stock.prices",
from = "2011-01-01",
to = "2021-08-31") %>%
group_by(symbol)
glimpse(myStocks) # examine the structure of the resulting data frame## Rows: 21,464
## Columns: 8
## Groups: symbol [8]
## $ symbol <chr> "EBR", "EBR", "EBR", "EBR", "EBR", "EBR", "EBR", "EBR", "EBR"…
## $ date <date> 2011-01-03, 2011-01-04, 2011-01-05, 2011-01-06, 2011-01-07, …
## $ open <dbl> 13.9, 14.1, 14.3, 14.3, 14.0, 13.8, 13.8, 13.8, 13.9, 14.0, 1…
## $ high <dbl> 14.0, 14.3, 14.5, 14.3, 14.0, 13.9, 13.8, 13.9, 14.1, 14.3, 1…
## $ low <dbl> 13.9, 14.0, 14.3, 14.0, 13.8, 13.7, 13.6, 13.7, 13.8, 13.9, 1…
## $ close <dbl> 14.0, 14.2, 14.4, 14.1, 14.0, 13.7, 13.7, 13.8, 14.0, 14.2, 1…
## $ volume <dbl> 878300, 802900, 950300, 855800, 586000, 1020300, 1545800, 112…
## $ adjusted <dbl> 9.85, 10.00, 10.13, 9.92, 9.82, 9.63, 9.65, 9.71, 9.84, 10.03…
We can see that the dataset is 8x21,464 tibble with each row being the ohlc (open,high,low,close) and volume for a stock on a given date.
Financial performance analysis depend on returns; If I buy a stock today for 100 and I sell it tomorrow for 101.75, my one-day return, assuming no transaction costs, is 1.75%. So given the adjusted closing prices, our first step is to calculate daily and monthly returns.
#calculate daily returns
myStocks_returns_daily <- myStocks %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "daily",
type = "log",
col_rename = "daily_returns",
cols = c(nested.col))
#calculate monthly returns
myStocks_returns_monthly <- myStocks %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "monthly",
type = "arithmetic",
col_rename = "monthly_returns",
cols = c(nested.col))
#calculate yearly returns
myStocks_returns_annual <- myStocks %>%
group_by(symbol) %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "yearly",
type = "arithmetic",
col_rename = "yearly_returns",
cols = c(nested.col))Create a table where you summarise monthly returns for each of the stocks and SPY; min, max, median, mean, SD.
glimpse(myStocks_returns_monthly)## Rows: 1,024
## Columns: 3
## Groups: symbol [8]
## $ symbol <chr> "EBR", "EBR", "EBR", "EBR", "EBR", "EBR", "EBR", "EBR"…
## $ date <date> 2011-01-31, 2011-02-28, 2011-03-31, 2011-04-29, 2011-…
## $ monthly_returns <dbl> -0.0264, 0.0487, 0.0861, -0.0445, -0.0263, -0.0264, -0…
monthlystocks_summarised <- myStocks_returns_monthly %>%
group_by(symbol) %>%
summarise(min_return = min(monthly_returns),
max_return = max(monthly_returns),
median_return = median(monthly_returns),
mean_return = mean(monthly_returns),
sd_return = sd(monthly_returns))Plot a density plot, using geom_density(), for each of the stocks
ggplot(myStocks_returns_monthly, aes(x = monthly_returns)) +
geom_density() +
facet_wrap(~symbol) +
labs(
title = "Density Plots for Monthly Returns of Stocks",
x = "Monthly Return",
y = "Density"
) +
NULLWhat can you infer from this plot? Which stock is the riskiest? The least risky?
TYPE YOUR ANSWER AFTER (AND OUTSIDE!) THIS BLOCKQUOTE.
The monthly returns for the flatter density plots are more dispersed whereas those with tall peaks are more concentrated. The riskiest stock is EBR and the least risky is SPY (as an ETF) due to the shape of their peaks.
Finally, make a plot that shows the expected monthly return (mean) of a stock on the Y axis and the risk (standard deviation) in the X-axis. Please use ggrepel::geom_text_repel() to label each stock
monthlystocks_summarised%>%
ggplot(aes(y = mean_return, x=sd_return)) +
geom_point() +
ggrepel::geom_text_repel(aes(label = symbol)) +
NULLWhat can you infer from this plot? Are there any stocks which, while being riskier, do not have a higher expected return?
TYPE YOUR ANSWER AFTER (AND OUTSIDE!) THIS BLOCKQUOTE.
EBR is the most risky as it has the highest standard deviation of returns. EBR has the highest sd but also a high expected return. AEE, CAT and SPY also have high expected returns with less standard deviation. BBL and BUD have lower expected returns and higher standard deviations meaning they are riskier and do not have high expected returns. SPY produces a good return with lower risk.
For this task, you will analyse a data set on Human Resource Analytics. The IBM HR Analytics Employee Attrition & Performance data set is a fictional data set created by IBM data scientists. Among other things, the data set includes employees’ income, their distance from work, their position in the company, their level of education, etc. A full description can be found on the website.
First let us load the data
hr_dataset <- read_csv(here::here("data", "datasets_1067_1925_WA_Fn-UseC_-HR-Employee-Attrition.csv"))
glimpse(hr_dataset)## Rows: 1,470
## Columns: 35
## $ Age <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
## $ Attrition <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "…
## $ BusinessTravel <chr> "Travel_Rarely", "Travel_Frequently", "Travel…
## $ DailyRate <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
## $ Department <chr> "Sales", "Research & Development", "Research …
## $ DistanceFromHome <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
## $ Education <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
## $ EducationField <chr> "Life Sciences", "Life Sciences", "Other", "L…
## $ EmployeeCount <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ EmployeeNumber <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,…
## $ EnvironmentSatisfaction <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, …
## $ Gender <chr> "Female", "Male", "Male", "Female", "Male", "…
## $ HourlyRate <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4…
## $ JobInvolvement <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, …
## $ JobLevel <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
## $ JobRole <chr> "Sales Executive", "Research Scientist", "Lab…
## $ JobSatisfaction <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, …
## $ MaritalStatus <chr> "Single", "Married", "Single", "Married", "Ma…
## $ MonthlyIncome <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
## $ MonthlyRate <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 9964…
## $ NumCompaniesWorked <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
## $ Over18 <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …
## $ OverTime <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes",…
## $ PercentSalaryHike <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
## $ PerformanceRating <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, …
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, …
## $ StandardHours <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8…
## $ StockOptionLevel <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
## $ TotalWorkingYears <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
## $ TrainingTimesLastYear <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
## $ WorkLifeBalance <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, …
## $ YearsAtCompany <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,…
## $ YearsInCurrentRole <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, …
## $ YearsSinceLastPromotion <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, …
## $ YearsWithCurrManager <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, …
I am going to clean the data set, as variable names are in capital letters, some variables are not really necessary, and some variables, e.g., education are given as a number rather than a more useful description
hr_cleaned <- hr_dataset %>%
clean_names() %>%
mutate(
education = case_when(
education == 1 ~ "Below College",
education == 2 ~ "College",
education == 3 ~ "Bachelor",
education == 4 ~ "Master",
education == 5 ~ "Doctor"
),
environment_satisfaction = case_when(
environment_satisfaction == 1 ~ "Low",
environment_satisfaction == 2 ~ "Medium",
environment_satisfaction == 3 ~ "High",
environment_satisfaction == 4 ~ "Very High"
),
job_satisfaction = case_when(
job_satisfaction == 1 ~ "Low",
job_satisfaction == 2 ~ "Medium",
job_satisfaction == 3 ~ "High",
job_satisfaction == 4 ~ "Very High"
),
performance_rating = case_when(
performance_rating == 1 ~ "Low",
performance_rating == 2 ~ "Good",
performance_rating == 3 ~ "Excellent",
performance_rating == 4 ~ "Outstanding"
),
work_life_balance = case_when(
work_life_balance == 1 ~ "Bad",
work_life_balance == 2 ~ "Good",
work_life_balance == 3 ~ "Better",
work_life_balance == 4 ~ "Best"
)
) %>%
select(age, attrition, daily_rate, department,
distance_from_home, education,
gender, job_role,environment_satisfaction,
job_satisfaction, marital_status,
monthly_income, num_companies_worked, percent_salary_hike,
performance_rating, total_working_years,
work_life_balance, years_at_company,
years_since_last_promotion)glimpse(hr_cleaned)## Rows: 1,470
## Columns: 19
## $ age <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35,…
## $ attrition <chr> "Yes", "No", "Yes", "No", "No", "No", "No",…
## $ daily_rate <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 135…
## $ department <chr> "Sales", "Research & Development", "Researc…
## $ distance_from_home <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26…
## $ education <chr> "College", "Below College", "College", "Mas…
## $ gender <chr> "Female", "Male", "Male", "Female", "Male",…
## $ job_role <chr> "Sales Executive", "Research Scientist", "L…
## $ environment_satisfaction <chr> "Medium", "High", "Very High", "Very High",…
## $ job_satisfaction <chr> "Very High", "Medium", "High", "High", "Med…
## $ marital_status <chr> "Single", "Married", "Single", "Married", "…
## $ monthly_income <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 2…
## $ num_companies_worked <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5…
## $ percent_salary_hike <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13,…
## $ performance_rating <chr> "Excellent", "Outstanding", "Excellent", "E…
## $ total_working_years <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5,…
## $ work_life_balance <chr> "Bad", "Better", "Better", "Better", "Bette…
## $ years_at_company <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, …
## $ years_since_last_promotion <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0…
Produce a one-page summary describing this dataset. Here is a non-exhaustive list of questions:
attrition)# 1233 employees stay while 237 left (19.2%).
hr_cleaned %>%
group_by(attrition)%>%
summarise(count = count(attrition))## # A tibble: 2 × 2
## attrition count
## <chr> <int>
## 1 No 1233
## 2 Yes 237
# to see the how attrition rate changes with years at company
prop.table(table(hr_cleaned[,c("years_at_company","attrition")]),1)[,2]%>%
barplot(main="Attrition Rate vs Years At Company")
abline(h=0.192,col="red") # avergae attrition rateage, years_at_company, monthly_income and years_since_last_promotion distributed? can you roughly guess which of these variables is closer to Normal just by looking at summary statistics?skim(hr_cleaned)| Name | hr_cleaned |
| Number of rows | 1470 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| character | 10 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| attrition | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| department | 0 | 1 | 5 | 22 | 0 | 3 | 0 |
| education | 0 | 1 | 6 | 13 | 0 | 5 | 0 |
| gender | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
| job_role | 0 | 1 | 7 | 25 | 0 | 9 | 0 |
| environment_satisfaction | 0 | 1 | 3 | 9 | 0 | 4 | 0 |
| job_satisfaction | 0 | 1 | 3 | 9 | 0 | 4 | 0 |
| marital_status | 0 | 1 | 6 | 8 | 0 | 3 | 0 |
| performance_rating | 0 | 1 | 9 | 11 | 0 | 2 | 0 |
| work_life_balance | 0 | 1 | 3 | 6 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 36.92 | 9.14 | 18 | 30 | 36 | 43 | 60 | ▂▇▇▃▂ |
| daily_rate | 0 | 1 | 802.49 | 403.51 | 102 | 465 | 802 | 1157 | 1499 | ▇▇▇▇▇ |
| distance_from_home | 0 | 1 | 9.19 | 8.11 | 1 | 2 | 7 | 14 | 29 | ▇▅▂▂▂ |
| monthly_income | 0 | 1 | 6502.93 | 4707.96 | 1009 | 2911 | 4919 | 8379 | 19999 | ▇▅▂▁▂ |
| num_companies_worked | 0 | 1 | 2.69 | 2.50 | 0 | 1 | 2 | 4 | 9 | ▇▃▂▂▁ |
| percent_salary_hike | 0 | 1 | 15.21 | 3.66 | 11 | 12 | 14 | 18 | 25 | ▇▅▃▂▁ |
| total_working_years | 0 | 1 | 11.28 | 7.78 | 0 | 6 | 10 | 15 | 40 | ▇▇▂▁▁ |
| years_at_company | 0 | 1 | 7.01 | 6.13 | 0 | 3 | 5 | 9 | 40 | ▇▂▁▁▁ |
| years_since_last_promotion | 0 | 1 | 2.19 | 3.22 | 0 | 0 | 1 | 3 | 15 | ▇▁▁▁▁ |
job_satisfaction and work_life_balance distributed? Don’t just report counts, but express categories as % of total?hr_cleaned %>%
group_by(job_satisfaction)%>%
summarise(countjs = n(),
percentagejs = countjs/nrow(hr_cleaned)*100)## # A tibble: 4 × 3
## job_satisfaction countjs percentagejs
## <chr> <int> <dbl>
## 1 High 442 30.1
## 2 Low 289 19.7
## 3 Medium 280 19.0
## 4 Very High 459 31.2
hr_cleaned %>%
group_by(work_life_balance)%>%
summarise(countwlb= n(),
percentagewlb = round(countwlb/nrow(hr_cleaned)*100,2))## # A tibble: 4 × 3
## work_life_balance countwlb percentagewlb
## <chr> <int> <dbl>
## 1 Bad 80 5.44
## 2 Best 153 10.4
## 3 Better 893 60.8
## 4 Good 344 23.4
ggplot(hr_cleaned, aes(x = fct_relevel(education,
"Doctor", "Master", "Bachelor",
"College", "Below College"), y = monthly_income)) +
geom_boxplot()+
labs(title = "Boxplot of Monthly Income against Education",
x = "Education",
y = "Monthly Income")+
NULLggplot(hr_cleaned, aes(x = monthly_income, y = gender)) +
geom_boxplot()+
labs(title = "Boxplot of Monthly Income against Gender",
x = "Monthly Income",
y = "Gender")+
NULLggplot(hr_cleaned, aes(x=fct_reorder(job_role,-monthly_income), y=monthly_income)) +
geom_boxplot() +
labs(title = "Boxplot of Monthly Income against Job Role",
x = "Job Role",
y = "Monthly Income")+
NULLhr_cleaned %>%
group_by(education) %>%
summarise(medianinc = median(monthly_income),
meaninc = mean(monthly_income)) %>%
ggplot(aes(x = fct_relevel(education,
"Doctor", "Master", "Bachelor",
"College", "Below College"),
y=meaninc)) +
geom_bar(stat = "identity") +
labs(title = "Average Monthly Income of Each Education Level",
x = "Education",
y = "Average Monthly Income")+
NULLggthemeshr_cleaned %>%
ggplot(aes(x=monthly_income)) +
geom_histogram(bins=30)+
facet_wrap(~fct_relevel(education,
"Doctor", "Master", "Bachelor",
"College", "Below College")) +
theme_wsj() +
NULLjob_rolehr_cleaned %>%
ggplot(aes(y=monthly_income, x=age)) +
geom_point() +
geom_smooth(method="lm")+
facet_wrap(~job_role) +
theme_wsj() +
NULLThe purpose of this exercise is to reproduce a plot using your dplyr and ggplot2 skills. Read the article The Racial Factor: There’s 77 Counties Which Are Deep Blue But Also Low-Vaxx. Guess What They Have In Common? and have a look at the attached figure.
You dont have to worry about the blue-red backgound and don’t worry about replicating it exactly, try and see how far you can get. You’re encouraged to work together if you want to and exchange tips/tricks you figured out– and even though the figure in the original article is from early July 2021, you can use the most recent data.
Some hints to get you started:
# 3154 unique fips code in election2020_results
length(unique(election2020_results$fips))## [1] 3154
# check unique values of candidate names
unique(election2020_results$candidate)## [1] "JOSEPH R BIDEN JR" "OTHER" "DONALD J TRUMP"
## [4] "JO JORGENSEN"
data <- election2020_results %>%
mutate(votes_percentage = candidatevotes/totalvotes) %>% # calculate percentage
filter(candidate=="DONALD J TRUMP", mode=="TOTAL") %>% # we only need Trump votes
inner_join(vaccinations,by = "fips") %>% # inner join with vaccinations data
inner_join(population,by = "fips") # inner join with population data
# generate graph below
# install.packages("ggpubr") # to show equation easily
library(ggpubr)
# calculate actual vaccination percentage
actual = sum(data$series_complete_yes)/sum(data$pop_estimate_2019)
data %>%
# filter out 0% vaccinated points
filter(series_complete_pop_pct>0) %>%
ggplot() +
geom_point(aes(x=votes_percentage,y=series_complete_pop_pct/100,size=pop_estimate_2019),shape=21,fill="light blue")+
# scale circle size
scale_size(range = c(0, 20)) +
# add points in the center of circles
geom_point(aes(x=votes_percentage,y=series_complete_pop_pct/100),size=0.5)+
# add regression line
geom_smooth(aes(x=votes_percentage,y=series_complete_pop_pct/100),
method="lm",linetype="dotted",colour="blue",se=FALSE)+
# add the equation of regression line
stat_regline_equation(aes(x=votes_percentage,y=series_complete_pop_pct/100),
label.y = 0.1,colour="red",fontface = "bold.italic") +
# add r square
stat_cor(aes(x=votes_percentage,y=series_complete_pop_pct/100,label = paste(..rr.label..)),
label.y = 0.05,colour="red")+
# add date
geom_text(aes(x=0.3, y=0.07,label = "09/03/2021",
fontface = "bold.italic"),colour="red")+
# add horizonal lines below
geom_hline(aes(yintercept=0.85), linetype=2) + # herd immunity line
geom_text(aes(x=0, y=0.85, label = "Herd Immunity threshold (?)",
vjust=-1, hjust=0, fontface = "bold.italic"),colour="blue") +
geom_hline(aes(yintercept=0.539), linetype=2) + # Target line
geom_text(aes(x=0, y=0.539,label = "TARGET: 53.9%",
vjust=-1,hjust=0, fontface = "bold.italic"),colour="blue") +
geom_hline(aes(yintercept=actual), linetype=2) + # actual line
geom_text(aes(x=0, y=actual, label = paste("ACTUAL: ", round(actual*100,1),"%"),
vjust=-1,hjust=0,fontface = "bold.italic"),
colour="blue") +
# adjuct grid lines
scale_x_continuous(label=scales::percent_format(accuracy = 1),
breaks=seq(0,1,0.05),limits = c(0,1))+
scale_y_continuous(label=scales::percent_format(accuracy = 1),
breaks=seq(0,1,0.05),limits = c(0,1))+
geom_text(aes(x=0.5,y=1, label = "EACH U.S. COUNTY",vjust=0.5,hjust=0.5,
family="mono",fontface = "bold"),
size=5)+
labs(title = "COVID-19 VACCINATION LEVELS OUT OF TOTAL POPULATION BY COUNTY",
x = "2020 Trump Vote %",
y = "% of Total Population Vaccinated")+
# remove the legend
theme(legend.position = "none")+
NULLThe Guardian newspaper has an election poll tracker for the upcoming German election. The list of the opinion polls since Jan 2021 can be found at Wikipedia and your task is to reproduce the graph similar to the one produced by the Guardian.
The following code will scrape the wikipedia page and import the table in a dataframe.
url <- "https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election"
# https://www.economist.com/graphic-detail/who-will-succeed-angela-merkel
# https://www.theguardian.com/world/2021/jun/21/german-election-poll-tracker-who-will-be-the-next-chancellor
# get tables that exist on wikipedia page
tables <- url %>%
read_html() %>%
html_nodes(css="table")
# parse HTML tables into a dataframe called polls
# Use purr::map() to create a list of all tables in URL
polls <- map(tables, . %>%
html_table(fill=TRUE)%>%
janitor::clean_names())
# list of opinion polls
german_election_polls <- polls[[1]] %>% # the first table on the page contains the list of all opinions polls
slice(2:(n()-1)) %>% # drop the first row, as it contains again the variable names and last row that contains 2017 results
mutate(
# polls are shown to run from-to, e.g. 9-13 Aug 2021. We keep the last date, 13 Aug here, as the poll date
# and we extract it by picking the last 11 characters from that field
end_date = str_sub(fieldwork_date, -11),
# end_date is still a string, so we convert it into a date object using lubridate::dmy()
end_date = dmy(end_date),
# we also get the month and week number from the date, if we want to do analysis by month- week, etc.
month = month(end_date),
week = isoweek(end_date)
)
german_election_polls %>%
ggplot()+
# union
geom_point(aes(x=end_date,y=union),colour="black",shape=21)+
# span=0.2 to make the line less smoothed
geom_smooth(aes(x=end_date,y=union),se=F,colour="black",span = 0.2)+
# spd
geom_point(aes(x=end_date,y=spd),colour="red",shape=21)+
geom_smooth(aes(x=end_date,y=spd),se=F,colour="red",span = 0.2)+
# afd
geom_point(aes(x=end_date,y=af_d),colour="blue",shape=21)+
geom_smooth(aes(x=end_date,y=af_d),se=F,colour="blue",span = 0.2)+
#fdp
geom_point(aes(x=end_date,y=fdp),colour="yellow",shape=21)+
geom_smooth(aes(x=end_date,y=fdp),se=F,colour="yellow",span = 0.2)+
#grune
geom_point(aes(x=end_date,y=grune),colour="dark green",shape=21)+
geom_smooth(aes(x=end_date,y=grune),se=F,colour="dark green",span = 0.2)+
#linke
geom_point(aes(x=end_date,y=linke),colour="purple",shape=21)+
geom_smooth(aes(x=end_date,y=linke),se=F,colour="purple",span = 0.2)+
#display every month
scale_x_date(date_labels="%b %y",date_breaks ="1 month")+
labs(
x="Date",
y="Votes %"
)+
NULLThere is a lot of explanatory text, comments, etc. You do not need these, so delete them and produce a stand-alone document that you could share with someone. Knit the edited and completed R Markdown file as an HTML document (use the “Knit” button at the top of the script editor window) and upload it to Canvas.
Please seek out help when you need it, and remember the 15-minute rule. You know enough R (and have enough examples of code from class and your readings) to be able to do this. If you get stuck, ask for help from others, post a question on Slack– and remember that I am here to help too!
As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?
Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.
Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).
Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.